Transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2010.
Changing format for a more lightweight solution.
#data <- readxl::read_xlsx("../online_retail_II.xlsx")
#saveRDS(data, file = "online_retail_II.rds")
Reading data.
colnames(data)
## [1] "Invoice" "StockCode" "Description" "Quantity" "InvoiceDate"
## [6] "Price" "Customer ID" "Country"
InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invoice date and time. Numeric. The day and time when a transaction was generated.
UnitPrice: Unit price. Numeric. Product price per unit in sterling (£).
CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal. The name of the country where a customer resides.
# some basic cleaning and factoring these columns for easier visualization of ratios
data <- data %>% clean_names()
data$description <- as.factor(data$description)
data$description <- fct_explicit_na(data$description)
data$country <- as.factor(data$country)
data$customer_id <- fct_explicit_na(as.factor(data$customer_id))
data$stock_code <- fct_explicit_na(as.factor(data$stock_code))
summary(data)
## invoice stock_code
## Length:525461 85123A : 3516
## Class :character 22423 : 2221
## Mode :character 85099B : 2057
## 21212 : 1933
## 21232 : 1843
## 20725 : 1620
## (Other):512271
## description quantity
## WHITE HANGING HEART T-LIGHT HOLDER: 3549 Min. :-9600.00
## (Missing) : 2928 1st Qu.: 1.00
## REGENCY CAKESTAND 3 TIER : 2212 Median : 3.00
## STRAWBERRY CERAMIC TRINKET BOX : 1843 Mean : 10.34
## PACK OF 72 RETRO SPOT CAKE CASES : 1466 3rd Qu.: 10.00
## ASSORTED COLOUR BIRD ORNAMENT : 1457 Max. :19152.00
## (Other) :512006
## invoice_date price customer_id
## Min. :2009-12-01 07:45:00 Min. :-53594.36 (Missing):107927
## 1st Qu.:2010-03-21 12:20:00 1st Qu.: 1.25 14911 : 5710
## Median :2010-07-06 09:51:00 Median : 2.10 17841 : 5114
## Mean :2010-06-28 11:37:36 Mean : 4.69 14606 : 3927
## 3rd Qu.:2010-10-15 12:45:00 3rd Qu.: 4.21 14156 : 2710
## Max. :2010-12-09 20:01:00 Max. : 25111.09 12748 : 2665
## (Other) :397408
## country
## United Kingdom:485852
## EIRE : 9670
## Germany : 8129
## France : 5772
## Netherlands : 2769
## Spain : 1278
## (Other) : 11991
tail(data)
## # A tibble: 6 x 8
## invoice stock_code description quantity invoice_date price customer_id
## <chr> <fct> <fct> <dbl> <dttm> <dbl> <fct>
## 1 538171 20971 PINK BLUE <85> 2 2010-12-09 20:01:00 1.25 17530
## 2 538171 22271 FELTCRAFT <85> 2 2010-12-09 20:01:00 2.95 17530
## 3 538171 22750 FELTCRAFT <85> 1 2010-12-09 20:01:00 3.75 17530
## 4 538171 22751 FELTCRAFT <85> 1 2010-12-09 20:01:00 3.75 17530
## 5 538171 20970 PINK FLORA<85> 2 2010-12-09 20:01:00 3.75 17530
## 6 538171 21931 JUMBO STOR<85> 2 2010-12-09 20:01:00 1.95 17530
## # <85> with 1 more variable: country <fct>
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 525461 obs. of 8 variables:
## $ invoice : chr "489434" "489434" "489434" "489434" ...
## $ stock_code : Factor w/ 4631 levels "10002","10002R",..: 3892 3052 3054 1394 709 1417 1278 972 1690 1689 ...
## $ description : Factor w/ 4644 levels "*Boombox Ipod Classic",..: 31 2943 4431 3259 4107 2963 3618 1492 829 1230 ...
## $ quantity : num 12 12 12 48 24 24 24 10 12 12 ...
## $ invoice_date: POSIXct, format: "2009-12-01 07:45:00" "2009-12-01 07:45:00" ...
## $ price : num 6.95 6.75 6.75 2.1 1.25 1.65 1.25 5.95 2.55 3.75 ...
## $ customer_id : Factor w/ 4384 levels "12346","12347",..: 508 508 508 508 508 508 508 508 508 508 ...
## $ country : Factor w/ 40 levels "Australia","Austria",..: 37 37 37 37 37 37 37 37 37 37 ...
#data explorer report
#DataExplorer::create_report(data)
#here we notice that quantity can be negative. this indicates it should be a good idea to create a total price value that multiplies qty * price
data %>% filter(startsWith(invoice,"C")) %>% filter(stock_code == "22087")
## # A tibble: 3 x 8
## invoice stock_code description quantity invoice_date price customer_id
## <chr> <fct> <fct> <dbl> <dttm> <dbl> <fct>
## 1 C489449 22087 PAPER BUNT<85> -12 2009-12-01 10:33:00 2.95 16321
## 2 C533899 22087 PAPER BUNT<85> -6 2010-11-19 11:50:00 2.95 13993
## 3 C536164 22087 PAPER BUNT<85> -65 2010-11-30 12:09:00 2.9 15369
## # <85> with 1 more variable: country <fct>
data <- data %>% mutate(total_price = quantity * price)
What we can see here is that: - in December 2009, there were sales up until the 23rd of December - in December 2010 sales were made only until the 9th of December. It suggests there might be missing data from the last days of 2010 - We can also deduce that this online retail stops working from Christmas until January.
ts_plot(decembercomparison2009, Ytitle = "Total Sales", line.mode = "lines+markers")
ts_plot(decembercomparison2010, Ytitle = "Total Sales", line.mode = "lines+markers")
Here we see the amount of different items per invoice per day for December 2009.
fntltp <- JS("function(){
return this.point.x + ' ' + this.series.yAxis.categories[this.point.y] + ':<br>' +
Highcharts.numberFormat(this.point.value, 2);
}")
data %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2009))) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(invoice, as.factor(day)) %>% mutate(sales_per_invoice = sum(total_price))%>% ungroup() %>% group_by(day) %>% count(invoice) %>% arrange(day, -n) %>% ungroup() %>% hchart("heatmap", hcaes(x = day, y = invoice, value = n)) %>%
hc_yAxis(reversed = TRUE, offset = -20, tickLength = 0,
gridLineWidth = 0, minorGridLineWidth = 0,
labels = list(style = list(fontSize = "8px"))) %>%
hc_tooltip(formatter = fntltp) %>%
hc_size(height = 1000)
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.
data %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2010))) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(invoice, as.factor(day)) %>% mutate(sales_per_invoice = sum(total_price))%>% ungroup() %>% group_by(day) %>% count(invoice) %>% arrange(day, -n) %>% ungroup() %>% hchart("heatmap", hcaes(x = day, y = invoice, value = n)) %>%
hc_yAxis(reversed = TRUE, offset = -20, tickLength = 0,
gridLineWidth = 0, minorGridLineWidth = 0,
labels = list(style = list(fontSize = "8px"))) %>%
hc_tooltip(formatter = fntltp) %>%
hc_size(height = 1000)
data %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2009, 2010))) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(customer_id, as.factor(day), year) %>% summarise(sales_per_invoice = sum(total_price)) %>% ungroup() %>% arrange(-sales_per_invoice) %>% head(20)
## # A tibble: 20 x 4
## customer_id `as.factor(day)` year sales_per_invoice
## <fct> <fct> <dbl> <dbl>
## 1 18102 7 2010 25920.
## 2 (Missing) 6 2010 23395.
## 3 (Missing) 3 2010 23022
## 4 (Missing) 14 2009 21913.
## 5 18102 3 2009 18475
## 6 (Missing) 22 2009 17307.
## 7 (Missing) 16 2009 15618.
## 8 (Missing) 9 2010 15354.
## 9 13694 14 2009 14190.
## 10 (Missing) 1 2010 12584.
## 11 (Missing) 7 2009 11914.
## 12 18102 11 2009 11016
## 13 (Missing) 18 2009 10895.
## 14 (Missing) 1 2009 10465.
## 15 (Missing) 2 2009 10185.
## 16 (Missing) 9 2009 9502.
## 17 15061 2 2010 9407.
## 18 14646 18 2009 8370.
## 19 14156 21 2009 7113.
## 20 (Missing) 4 2009 6660.
data %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2009, 2010))) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(invoice, as.factor(day)) %>% mutate(sales_per_invoice = sum(total_price)) %>% ungroup() %>% count(description) %>% arrange(-n) %>% head(10) %>%
hchart(type = "bar", hcaes(x = description, y = n))
head(summary(data$description), 20)
## WHITE HANGING HEART T-LIGHT HOLDER (Missing)
## 3549 2928
## REGENCY CAKESTAND 3 TIER STRAWBERRY CERAMIC TRINKET BOX
## 2212 1843
## PACK OF 72 RETRO SPOT CAKE CASES ASSORTED COLOUR BIRD ORNAMENT
## 1466 1457
## 60 TEATIME FAIRY CAKE CASES HOME BUILDING BLOCK WORD
## 1400 1386
## JUMBO BAG RED RETROSPOT LUNCH BAG RED SPOTTY
## 1310 1274
## REX CASH+CARRY JUMBO SHOPPER JUMBO STORAGE BAG SUKI
## 1232 1220
## PACK OF 60 PINK PAISLEY CAKE CASES WOODEN FRAME ANTIQUE WHITE
## 1196 1190
## LUNCH BAG BLACK SKULL. BAKING SET 9 PIECE RETROSPOT
## 1179 1176
## LUNCH BAG SUKI DESIGN HEART OF WICKER LARGE
## 1157 1151
## LOVE BUILDING BLOCK WORD RED HANGING HEART T-LIGHT HOLDER
## 1142 1129
Most similar clients:
#' Compute Gower distance in december 2010 data because it's time consuming
concentrationdf <- data %>% group_by(customer_id, country) %>% filter(month(invoice_date) == 12 & (year(invoice_date) %in% c(2010))) %>% select(-invoice_date, -price, -quantity) %>% summarise(sum_total_price = sum(total_price))
#concentrationdf$invoice <- as.factor(concentrationdf$invoice)
gower_dist <- daisy(concentrationdf, metric = "gower")
gower_mat <- as.matrix(gower_dist)
#' Print most similar clients
concentrationdf[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]
## # A tibble: 2 x 3
## # Groups: customer_id [2]
## customer_id country sum_total_price
## <fct> <fct> <dbl>
## 1 13402 United Kingdom 308.
## 2 13369 United Kingdom 308.
Most dissimilar clients:
#' Print most dissimilar clients
concentrationdf[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]
## # A tibble: 2 x 3
## # Groups: customer_id [2]
## customer_id country sum_total_price
## <fct> <fct> <dbl>
## 1 (Missing) United Kingdom 72316.
## 2 12721 France -27.2
sil_width <- c(NA)
for(i in 2:15){
pam_fit <- pam(gower_dist, diss = TRUE, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:15, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width") +
lines(1:15, sil_width)
## integer(0)
Here we can see the results of the clusterisation:
k <- 10
pam_fit <- pam(gower_dist, diss = TRUE, k)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering))
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))
Used Gower distance to find similar customers and found a pattern in one group. The rest of the 9. It needs more tuning since variables were left behind due to hard drive size limitations.
We have entries from 40 different countries.
countryfreq <- summary(data$country)
countries <- stack(countryfreq)
countries %>% rename(country = ind, n = values) %>% arrange(-n)
## n country
## 1 485852 United Kingdom
## 2 9670 EIRE
## 3 8129 Germany
## 4 5772 France
## 5 2769 Netherlands
## 6 1278 Spain
## 7 1187 Switzerland
## 8 1101 Portugal
## 9 1054 Belgium
## 10 906 Channel Islands
## 11 902 Sweden
## 12 731 Italy
## 13 654 Australia
## 14 554 Cyprus
## 15 537 Austria
## 16 517 Greece
## 17 432 United Arab Emirates
## 18 428 Denmark
## 19 369 Norway
## 20 354 Finland
## 21 310 Unspecified
## 22 244 USA
## 23 224 Japan
## 24 194 Poland
## 25 172 Malta
## 26 154 Lithuania
## 27 117 Singapore
## 28 111 RSA
## 29 107 Bahrain
## 30 77 Canada
## 31 76 Hong Kong
## 32 76 Thailand
## 33 74 Israel
## 34 71 Iceland
## 35 63 Korea
## 36 62 Brazil
## 37 54 West Indies
## 38 34 Bermuda
## 39 32 Nigeria
## 40 13 Lebanon
dynamics <- data %>% filter(day(invoice_date) == 14 & month(invoice_date) == 11 & (year(invoice_date) %in% c(2010))) %>% select(-quantity)
#lags
dyna_tl <- mutate( dynamics, total_price = lag( total_price ) )
dyna_all <- bind_rows( old = dynamics, new = dyna_tl, .id="source" ) %>%
arrange( invoice_date, source)
ggplot(dynamics, aes(invoice_date, total_price)) +
geom_step() +
geom_ribbon( data = dyna_all, aes( ymin = 0, ymax = total_price ),
fill="tomato", alpha=0.5 )
This can be reproduced in a bigger scale by increasing the amount of minimum interactions of a customer with the store.
loyaltysubset <- data %>% filter(month(invoice_date) == 8 & (year(invoice_date) %in% c(2010))) %>% select(-stock_code, -description, -invoice, -quantity, -price)
#remove unimportant clients
tt <- table(loyaltysubset$customer_id)
rare_levels <- names(tt)[tt<200]
loyaltysubset <- subset(loyaltysubset,!customer_id %in% rare_levels)
summary(loyaltysubset)
## invoice_date customer_id country
## Min. :2010-08-01 14:23:00 (Missing):6364 United Kingdom:7720
## 1st Qu.:2010-08-10 12:08:00 16549 : 507 EIRE : 462
## Median :2010-08-13 18:07:00 17841 : 353 Hong Kong : 2
## Mean :2010-08-16 04:23:50 14911 : 284 Australia : 0
## 3rd Qu.:2010-08-24 14:29:00 14606 : 236 Austria : 0
## Max. :2010-08-31 17:31:00 13089 : 225 Bahrain : 0
## (Other) : 215 (Other) : 0
## total_price
## Min. :-18910.69
## 1st Qu.: 2.51
## Median : 5.06
## Mean : 10.58
## 3rd Qu.: 11.82
## Max. : 2367.28
##
plot_timeline <- plot_timeline(loyaltysubset, save_path = "loyalty.png")
## INFO [2020-03-19 23:58:38] invoice_date has been selected as the timestamp column
## INFO [2020-03-19 23:58:38] total_price has been selected as the numeric column(s)
## INFO [2020-03-19 23:58:38] customer_id, country has been selected as the state column(s)
## INFO [2020-03-19 23:58:38] creating state plot layers
## INFO [2020-03-19 23:58:39] creating sample plot layers
## Aligning plots
## INFO [2020-03-19 23:58:39] Writing image to file as PNG in:
## Plotting
grid.draw(plot_timeline, recording=TRUE)
Timeline of loyalty subset
jan2010 <- data %>% filter(month(invoice_date) == 1 & year(invoice_date) == 2010) %>% mutate(day = lubridate::day(invoice_date), year = year(invoice_date)) %>% group_by(day, year) %>% summarise(daily_sales=sum(total_price)) %>% ungroup()
jan2010 <- jan2010 %>% mutate( date = as.Date(ymd(paste0(year, "-01-", day)))) %>% select(daily_sales, date)
Plot of SMA
smooth::sma(jan2010$daily_sales, h = 5, interval =TRUE, silent= FALSE)
## Time elapsed: 0.02 seconds
## Model estimated: SMA(24)
## Initial values were produced using backcasting.
##
## Loss function type: MSE; Loss function value: 183000530.6431
## Error standard deviation: 14129.2938
## Sample size: 24
## Number of estimated parameters: 2
## Number of degrees of freedom: 22
## Information criteria:
## AIC AICc BIC BICc
## 528.7090 529.2805 531.0651 531.9732
##
## 95% parametric prediction interval was constructed